This project focus on RFM model and apply this model to customer segmentation. The RFM model is based on three quantitative factors:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.io as pio
#pip install -U kaleido
import kaleido
pio.renderers.default='notebook+png'
df = pd.read_csv('Data.csv')
df.head()
| @timestamp | price | uid | |
|---|---|---|---|
| 0 | 2021/4/11 23:59 | 48 | 94244483 |
| 1 | 2021/4/11 23:58 | 8 | 94228493 |
| 2 | 2021/4/11 23:58 | 200 | 94244423 |
| 3 | 2021/4/11 23:57 | 48 | 94244423 |
| 4 | 2021/4/11 23:56 | 48 | 94243723 |
df.dtypes
@timestamp object price int64 uid int64 dtype: object
Notice that @timestamp column is object data type. We need change it to time data type but only keep the date part.
df['@timestamp'] = pd.to_datetime(df['@timestamp']).dt.floor('d')
df.head()
| @timestamp | price | uid | |
|---|---|---|---|
| 0 | 2021-04-11 | 48 | 94244483 |
| 1 | 2021-04-11 | 8 | 94228493 |
| 2 | 2021-04-11 | 200 | 94244423 |
| 3 | 2021-04-11 | 48 | 94244423 |
| 4 | 2021-04-11 | 48 | 94243723 |
Also, Notice that uid is int data type. We need change it to category data type.
df['uid'] = df['uid'].astype('category')
# dataframe info
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 100000 entries, 0 to 99999 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 @timestamp 100000 non-null datetime64[ns] 1 price 100000 non-null int64 2 uid 100000 non-null category dtypes: category(1), datetime64[ns](1), int64(1) memory usage: 4.3 MB
# numerical column statistical information
df.describe()
| price | |
|---|---|
| count | 100000.000000 |
| mean | 187.057440 |
| std | 484.823679 |
| min | 8.000000 |
| 25% | 8.000000 |
| 50% | 48.000000 |
| 75% | 240.000000 |
| max | 5184.000000 |
fig = px.histogram(df, x='price', nbins=20, width=800, height=400)
fig.show()
From above histogram, we can see that most customers spend $0-\\$400 on games.
# calcualte F,M score by user
df1 = df.copy()
df1= df1.groupby('uid').agg(
last_date = ('@timestamp', 'max'),
F_score = ('@timestamp', 'count'),
M_score = ('price', 'sum')).reset_index()
df1.head()
| uid | last_date | F_score | M_score | |
|---|---|---|---|---|
| 0 | 1000023 | 2021-04-02 | 1 | 240 |
| 1 | 1000063 | 2021-04-06 | 1 | 8 |
| 2 | 1000583 | 2021-03-29 | 1 | 8 |
| 3 | 1000873 | 2021-03-28 | 1 | 8 |
| 4 | 1000923 | 2021-03-12 | 2 | 96 |
df['@timestamp'].max()
Timestamp('2021-04-11 00:00:00')
# calculate R score based on last date
df1['R_score'] = (pd.Timestamp(df['@timestamp'].max())-df1['last_date']).dt.days
df1 = df1[['uid', 'last_date', 'R_score', 'F_score', 'M_score']] #adjust the column orders
df1.head()
| uid | last_date | R_score | F_score | M_score | |
|---|---|---|---|---|---|
| 0 | 1000023 | 2021-04-02 | 9 | 1 | 240 |
| 1 | 1000063 | 2021-04-06 | 5 | 1 | 8 |
| 2 | 1000583 | 2021-03-29 | 13 | 1 | 8 |
| 3 | 1000873 | 2021-03-28 | 14 | 1 | 8 |
| 4 | 1000923 | 2021-03-12 | 30 | 2 | 96 |
df1.describe()
| R_score | F_score | M_score | |
|---|---|---|---|
| count | 51477.000000 | 51477.000000 | 51477.000000 |
| mean | 20.021641 | 1.942615 | 363.380617 |
| std | 12.200368 | 2.456635 | 1848.309668 |
| min | 0.000000 | 1.000000 | 8.000000 |
| 25% | 9.000000 | 1.000000 | 48.000000 |
| 50% | 21.000000 | 1.000000 | 48.000000 |
| 75% | 30.000000 | 2.000000 | 240.000000 |
| max | 41.000000 | 102.000000 | 137392.000000 |
from plotly.subplots import make_subplots
import plotly.graph_objects as go
fig = make_subplots(rows=1, cols=3, subplot_titles=("R_score", "F_score", "M_score"))
fig.add_trace(
go.Histogram(x=df1['R_score'],nbinsx=10),
row=1, col=1
)
fig.add_trace(
go.Histogram(x=df1['F_score'], nbinsx=10),
row=1, col=2
)
fig.add_trace(
go.Histogram(x=df1['M_score'], nbinsx=10),
row=1, col=3
)
fig.update_layout(title_text="Multiple Subplots for R,F,M score", showlegend=False)
fig.show()
From above subplots, we can see that R_score distribute relatively even. But F_socre and M_score distribute uneven and most of customers fall in $0-\\$19.9k group and 0-19 purchase frequency group.
Even though we can label a customer 'high value' or 'low value' based on mean value. Here I am gona using machine learning algorithm K-Means to segment customers.
distortion: mean sum of squared distances to centers
silhouette: mean ratio of intra-cluster and nearest-cluster distance
calinski_harabasz: ratio of within to between cluster dispersion
I am using distortion score in this project.
# define show_elbow function
from sklearn.cluster import KMeans
import plotly.express as px
def show_elbow(df):
distance_list = []
K = range(1,9)
for k in K:
kmeans = KMeans(n_clusters=k, random_state=4, max_iter=100)
kmeans = kmeans.fit(df) # train the model
distance_list.append(kmeans.inertia_) # loss function value
fig= px.line(x= K, y = distance_list, width=800, height=400, title='elbow curve for k means',
labels = dict(x="K value", y = "distortion score"))
fig.show()
show_elbow(df1[['R_score']])
show_elbow(df1[['F_score']])
show_elbow(df1[['M_score']])
show_elbow(df1[['R_score', 'F_score','M_score']])
From elbow curve, we can see that the best k for R,F,M value is around 2~3. This time I am using Python package, kneed, to identify the elbow point programmatically.
#! pip install kneed
distance_list = []
K = range(1,9)
for k in K:
kmeans = KMeans(n_clusters=k, random_state=4, max_iter=100)
kmeans = kmeans.fit(df1[['R_score', 'F_score', 'M_score']]) # train the model
distance_list.append(kmeans.inertia_)
print(distance_list)
[175862787189.92487, 86424663251.37645, 48909711419.30495, 29071362196.4047, 17829490781.70125, 12101574040.75576, 8349599368.586826, 6421270038.266933]
from kneed import KneeLocator
# find knee
kneedle = KneeLocator(x=K, y=distance_list, curve= 'convex', direction='decreasing')
kneedle.plot_knee()
We choose 3 for the best k value. So we can segment customers to 3 clusters (High, Middle, Low).
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# X
X = df1[['R_score', 'F_score', 'M_score']]
# standardize X
scaler = StandardScaler()
scaler.fit_transform(X)
# build KMeans model
kmeans= KMeans(n_clusters=3, random_state=4) # k=3
# fit the model
kmeans.fit(X)
KMeans(n_clusters=3, random_state=4)
df1['RFM_level'] = kmeans.predict(X)
df1.sample(10)
| uid | last_date | R_score | F_score | M_score | RFM_level | |
|---|---|---|---|---|---|---|
| 8464 | 42145783 | 2021-03-20 | 22 | 7 | 16424 | 2 |
| 20984 | 77658913 | 2021-03-06 | 36 | 1 | 48 | 0 |
| 28881 | 81541103 | 2021-03-21 | 21 | 1 | 240 | 0 |
| 46217 | 90254593 | 2021-04-03 | 8 | 2 | 56 | 0 |
| 47696 | 91265333 | 2021-04-05 | 6 | 1 | 240 | 0 |
| 45363 | 89664833 | 2021-04-07 | 4 | 1 | 48 | 0 |
| 30911 | 82496513 | 2021-03-29 | 13 | 3 | 104 | 0 |
| 13753 | 70718413 | 2021-03-14 | 28 | 3 | 256 | 0 |
| 11718 | 62137213 | 2021-03-06 | 36 | 1 | 8 | 0 |
| 18499 | 76497473 | 2021-03-04 | 38 | 1 | 48 | 0 |
# sort by sum of R,F,M scores mean
df1_new = df1.groupby('RFM_level')[['R_score', 'F_score', 'M_score']].mean().sum(axis=1).reset_index(name='total').sort_values(by='total').reset_index(drop=True)
df1_new
| RFM_level | total | |
|---|---|---|
| 0 | 0 | 287.988449 |
| 1 | 2 | 13642.594595 |
| 2 | 1 | 70864.866667 |
# new column index after sorted
df1_new['Rank'] = df1_new.index
df1_new
| RFM_level | total | Rank | |
|---|---|---|---|
| 0 | 0 | 287.988449 | 0 |
| 1 | 2 | 13642.594595 | 1 |
| 2 | 1 | 70864.866667 | 2 |
# merge two dataframes
df1_new = pd.merge(df1, df1_new[['RFM_level', 'Rank']], on = 'RFM_level')
df1_new
| uid | last_date | R_score | F_score | M_score | RFM_level | Rank | |
|---|---|---|---|---|---|---|---|
| 0 | 1000023 | 2021-04-02 | 9 | 1 | 240 | 0 | 0 |
| 1 | 1000063 | 2021-04-06 | 5 | 1 | 8 | 0 | 0 |
| 2 | 1000583 | 2021-03-29 | 13 | 1 | 8 | 0 | 0 |
| 3 | 1000873 | 2021-03-28 | 14 | 1 | 8 | 0 | 0 |
| 4 | 1000923 | 2021-03-12 | 30 | 2 | 96 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 51472 | 76662313 | 2021-04-11 | 0 | 29 | 61016 | 1 | 2 |
| 51473 | 77131663 | 2021-03-06 | 36 | 11 | 49264 | 1 | 2 |
| 51474 | 78431723 | 2021-03-11 | 31 | 10 | 49280 | 1 | 2 |
| 51475 | 80074703 | 2021-03-12 | 30 | 14 | 59008 | 1 | 2 |
| 51476 | 87336953 | 2021-04-11 | 0 | 15 | 60784 | 1 | 2 |
51477 rows × 7 columns
# drop RFM_level column and rename index column to RFM level
df1_new = df1_new.drop('RFM_level', axis = 1)
df1_new = df1_new.rename(columns={'Rank':'RFM_level'})
df1_new['RFM_level'].dtype
dtype('int64')
# change data type
df1_new['RFM_level'] = df1_new['RFM_level'].astype('str')
# replace 0, 1, 2 to low value, middle value, high value
dict = {"0": "Low Value", "1": "Middle Value", "2": "High Value"}
for k, v in dict.items():
df1_new['RFM_level'] = df1_new['RFM_level'].replace(k, v)
df1_new
| uid | last_date | R_score | F_score | M_score | RFM_level | |
|---|---|---|---|---|---|---|
| 0 | 1000023 | 2021-04-02 | 9 | 1 | 240 | Low Value |
| 1 | 1000063 | 2021-04-06 | 5 | 1 | 8 | Low Value |
| 2 | 1000583 | 2021-03-29 | 13 | 1 | 8 | Low Value |
| 3 | 1000873 | 2021-03-28 | 14 | 1 | 8 | Low Value |
| 4 | 1000923 | 2021-03-12 | 30 | 2 | 96 | Low Value |
| ... | ... | ... | ... | ... | ... | ... |
| 51472 | 76662313 | 2021-04-11 | 0 | 29 | 61016 | High Value |
| 51473 | 77131663 | 2021-03-06 | 36 | 11 | 49264 | High Value |
| 51474 | 78431723 | 2021-03-11 | 31 | 10 | 49280 | High Value |
| 51475 | 80074703 | 2021-03-12 | 30 | 14 | 59008 | High Value |
| 51476 | 87336953 | 2021-04-11 | 0 | 15 | 60784 | High Value |
51477 rows × 6 columns
df2=df1_new.groupby('RFM_level')['uid'].nunique().to_frame('unique user number')\
.sort_values(by= 'unique user number', ascending = False).reset_index()
df2
| RFM_level | unique user number | |
|---|---|---|
| 0 | Low Value | 51166 |
| 1 | Middle Value | 296 |
| 2 | High Value | 15 |
import plotly.express as px
fig = px.bar(df2, x='RFM_level', y='unique user number', text='unique user number', width=800, height=500)
fig.update_traces(
textposition='outside',
)
fig.show()
From the above bar chart, we can see that the number of low value customers is much higher than middle value customers and high value customers.
# 3D Scatter Plot
import plotly.express as px
fig = px.scatter_3d(df1_new, x = 'R_score', y = 'F_score', z = 'M_score',
color='RFM_level')
fig.show()
By applying machine learning algorithm KMeans in RFM model to analyze customer segmentation, we have a result that most of our low value customers (99%) fall in \$0-\\$19.9k group and 0-19 purchase frequency group. Differentiated and personalized marketing strategies can be applied to different segments of customers.